Skip to content

Conversation

@tharittk
Copy link
Collaborator

@tharittk tharittk commented Jun 3, 2025

I added the support for NCCL to the MPIVStack. Along the way, I discover some interesting issue and it may be worth discussing so I initiate this drafted PR

Change Made
DistributedArray.py

  • MPIVStack adjoint operation takes decorator @reshape which calls add_ghost_cells() in this file. add_ghost_cells() has to be modified to support NCCL. There two points I want to point out.
  1. the call to self._allgather(cell_fronts): this cell_fronts are the metadata and is small in size (list of ints, size = total ranks). Under the current implementation, I will dispatch to call NCCL if NCCL is enabled. Should we enforce it to always use MPI instead ?
  2. the call to self.base_comm.send: this sends the ghost cells to peer and the sent array can be large. Under the current code, it will always use MPI. If we want to have it use NCCL, we will need to implement point-to-point NCCL. Luckily, CuPy NCCL supports this. I think there will be just adding more call in _nccl.py

VStack.py

  • the operator now takes one option argument base_comm_nccl - just like what we did in DistrbiutedArray. I did not change the MPILinearOperator interface to take this argument though. But I don't have a strong opinion on either case.
  • the output y is from Op @ x and Op.H @ x now initialized to have the same base_comm_nccl as x i.e., if x lives in NCCL, y should lives and communicate with NCCL too.

test_stack_nccl.py

  • this closely follows its counterpart MPI test_stack.py but tests explicitly in NCCL environment
  • script for testing HStack operator

@mrava87
Copy link
Contributor

mrava87 commented Jun 3, 2025

@tharittk, very good start!

I am going to reply to some of your comments/questions and will look more closely at your code in the next few days.

  • self._allgather(cell_fronts): I may have said otherwise on slack, but I agree that this is only dealing with indices so I would leave it to MPI even when we use NCCL as we did for other similar operations in the DistributedArray PR.
  • send/recv I agree that we should follow the same approach and have their implementation in the _nccl file and then dispatch to the correct one based on whether the base_comm_nccl is present or not... and yes, we should definitely allow this to be using NCCL if we have a base_comm_nccl because this could be sending larger arrays and so be a bottleneck that NCCL can speed up.
  • I also don't have strong feelings regarding whether we should add base_comm_nccl to MPILinearOperator or not... we need the MPI basecomm to get rank and size but other than that we don't use it, so probably not worth also passing an d storing the NCCL communicator. We will however need to change MPILinearOperator a bit (here or in another PR later) as MPILinearOperator is used to wrap PyLops operators that we want to be identically applied over every rank to a DistributedArray with BROADCAST partition. So we will need to modify how we create y (
    y = DistributedArray(global_shape=self.shape[0],
    ) passing base_comm_nccl=x.base_comm_nccl to ensure that this is not lost in y in case the next operator in the chain has some form of communications (say e.g. we apply a FirstDerivative) and there I think it would be good that the input Distributed array carries the correct base_comm_nccl (in case we end up in some situation that we want to do some checks)...

@tharittk
Copy link
Collaborator Author

tharittk commented Jun 4, 2025

I see, for the third point about passing base_comm_nccl=x.base_comm_nccl, currently I test on calling Op @ x which calls the concrete instance of MPIVStack and thus nccl_comm is passed (even when I did not change the MPILinearOperator interface). So when I check which collective calls y operates, it says NCCL.

If somehow the Op @ x is called and Op is an instance of MPILinearOperator, nccl_commis lost becauseynow takes the default value ofbase_comm_nccl=None`.That is something I did not catch.

@tharittk tharittk marked this pull request as ready for review June 8, 2025 04:09
@tharittk
Copy link
Collaborator Author

tharittk commented Jun 8, 2025

The most recent commits reflect some changes I want to point out

  • Remove of base_comm_nccl argument to the concrete MPILinearOperator class. The argument seems unnecessary and such information can be taken from operand x whenever math _matvec or _rmatvec is called.
  • adding of nccl_send and nccl_recv - to test this implementation, the test case of MPIVStack is not enough. it requires to have a test that cell_front is not zero i.e., not sending empty buffer. So I add the test_blockdiag_nccl.py. In this file, the StackedBDiag has theFirstDerivative called. This first derivative triggners the nccl_send and nccl_recv with meaning cells.

Copy link
Contributor

@mrava87 mrava87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tharittk very good!

Everything makes sense to me and I agree with the code changes. I just left some minor suggestions. After you have taken those into account, I think we can merge this PR 😄

One more minor thing, as you progress don't forget to keep the table in the gpu.rst up to date.

local_shapes = None
global_shape = getattr(self, "dims")
arr = DistributedArray(global_shape=global_shape,
base_comm_nccl=x.base_comm_nccl,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are changing this, I think it would be safe to also pass base_comm_nccl=x.base_comm.. I think in the past this never led to any issue as we probably always used MPI.COMM_WORLD but it's good not to assume this to be always the case 😄 (@rohanbabbar04, agree?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
base_comm_nccl=x.base_comm_nccl,
base_comm=x.base_comm,
base_comm_nccl=x.base_comm_nccl,

__all__ = [
"initialize_nccl_comm",
"nccl_split",
"nccl_allgather",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be good to add all nccl_* methods to the Utils section of https://github.com/PyLops/pylops-mpi/blob/main/docs/source/api/index.rst

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I can do that. Maybe in the other PR ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. If it is very small like this we can go for the same PR, if it is something a bit more consistent like the changes you made previously, it is good practice to have a separate documentation-only PR😄

@tharittk
Copy link
Collaborator Author

Since the PR involves adding NCCL support to the _send and _recv which directly impact the add_ghost_cells, I decide to make change to FirstDerivative and SecondDerivative so that they support NCCL. These implicitly make Gradient and Laplacian work as well - those two require no code changes.

@tharittk tharittk changed the title support nccl in add_ghost_cells and NCCL-VStack add NCCL support to add_ghost_cells and operators in /basicoperators Jun 10, 2025
Copy link
Contributor

@mrava87 mrava87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tharittk I have reviewed the new additions and they look great to me!

There are still a few conversations to resolve (for one of them we can hopefully get @rohanbabbar04's opinion), and I would like to hear from @rohanbabbar04 if he has any general comment, after that I will merge.

@rohanbabbar04
Copy link
Collaborator

rohanbabbar04 commented Jun 14, 2025

Sorry, I missed your messages. I’ll review the PR in a day or two and share my comments.

Copy link
Collaborator

@rohanbabbar04 rohanbabbar04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all cases, I would use base_comm = x.base_comm along with base_comm_nccl = x.base_comm_nccl everywhere, since we handle both cases. This should not cause any issues, as base_comm changes to MPI.COMM_WORLD when base_comm_nccl is not None.

local_shapes = None
global_shape = getattr(self, "dims")
arr = DistributedArray(global_shape=global_shape,
base_comm_nccl=x.base_comm_nccl,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
base_comm_nccl=x.base_comm_nccl,
base_comm=x.base_comm,
base_comm_nccl=x.base_comm_nccl,

@tharittk
Copy link
Collaborator Author

Thanks for the review @rohanbabbar04 @mrava87 !
I have pushed latest commit to reflect the suggested changes.

@mrava87
Copy link
Contributor

mrava87 commented Jun 17, 2025

Great, I am going to merge this. @tharittk great work 😄

@mrava87 mrava87 merged commit 9e25d8e into PyLops:main Jun 17, 2025
61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants